In this vignette, we will introduce how to use the R package LinCDE for conditional density estimation.
The density of a continuous random variable characterizes its relative likelihood of taking a specific value. In statistics, many questions are essentially questions of density characteristics, e.g., location, variance, modality, and skewness. A fruitful collection of methods have been proposed for density estimation, such as kernel density estimation, Lindsey’s method.
In practice, the density of interest is often heterogeneous across observations. For instance, the height goes up in the mean from children to adults, the density of food expenditure is more variant among the higher-income community relative to the lower-income community, and the density of salary is bi-modal among lawyers while unimodal among firefighters. The heterogeneity in conditional densities often carries meaningful messages.
We propose LinCDE boost to estimate the conditional densities: a boosting algorithm based on LinCDE trees. A LinCDE tree partitions the covariate space into subregions with approximately locally homogeneous densities, employs Lindsey’s method for density estimation in the subregions, and aggregates the densities from different sub-areas as the estimated conditional density. LinCDE boost grows a number of LinCDE trees in a stagewise forward manner, and each tree fits the residuals of the previous estimate. For more details, please refer to the LinCDE paper.
We illustrate the workflow of the LinCDE package using a toy example. Before all, let us install and attach the LinCDE package.
library("LinCDE")
Next, we prepare the example dataset. The LinCDE boost estimator requires a covariate matrix and a response vector. In this example, we generate \(20\) covariates uniformly from \([-1,1]\). Given a covariate value, the response is locally Gaussian. The response depends on the covariates through its mean and variance. In particular, \(X_1\) and \(X_2\) influence the response’s mean and \(X_2\) also influences the response’s variance. The lattice plot below visualizes the conditional densities at \(9\) different feature points.
# locally Gaussian design (LGD)
density.LGD = function(X, y = NULL){
if(is.null(dim(X))){X = matrix(X, nrow = 1)}
if(!is.null(y)){
dens = dnorm(y, mean = (0.5 * X[, 1] + X[, 1] * X[, 2]), sd = (0.5 + 0.25 * X[, 2]))
}else{
y = 0.5 * X[, 1] + X[, 1] * X[, 2] + rnorm(dim(X)[1], 0, 1) * (0.5 + 0.25 * X[ ,2])
}
}
# feature points for visualization
d = 20; XProfile = matrix(0, nrow = 3^2, ncol = d); colnames(XProfile) = paste("X", seq(1,d), sep = "")
X1 = X2 = c(-0.6,0,0.6); XProfile[, c(1,2)] = as.matrix(expand.grid(X1, X2))
# true conditional density plots
densityPlot(X = XProfile, trueDensity = density.LGD, minY = -3, maxY = 3, factor.levels = paste("X1=", XProfile[,1], ", X2=", XProfile[,2], sep = ""))
We conduct conditional density estimation based on \(1000\) training data points. We also generate independent validation and test samples of size \(1000\) for hyper-parameter tuning and performance evaluation, respectively.
set.seed(100)
# training data
n = 1000; X = matrix(runif(n * d, -1, 1), ncol = d); colnames(X) = paste("X", seq(1,d), sep = "")
y.LGD = density.LGD(X)
# validation data for tuning
nVal = 1000; XVal = matrix(runif(nVal * d, -1, 1), ncol = d); colnames(XVal) = paste("X", seq(1,d), sep = "")
yVal.LGD = density.LGD(XVal)
# test data
nTest = 1000; XTest = matrix(runif(nTest * d, -1, 1), ncol = d); colnames(XTest) = paste("X", seq(1,d), sep = "")
yTest.LGD = density.LGD(XTest)
There are a few hyper-parameters critical to the performance of LinCDE boost. The two primary parameters of interest are the number of iterations (n.trees) and the depths of LinCDE trees (depth).
n.trees: the number of LinCDE trees to fit. We train a LinCDE boost model on the training data with a large number of trees, e.g., 100 trees. We then compute the validation log-likelihoods after each iteration on a separate validation dataset. We choose the number of iterations corresponding to the maximal validation log-likelihood.
In the following example with depth = 2, we specify the maximal number of trees as \(100\). We evaluate the validation log-likelihood and plot the curve. The log-likelihood first increases then stabilizes at \(-0.76\) after \(50\) iterations. Therefore, we choose n.trees = 50 for depth = 2.
depth: the number of splits of each LinCDE tree. We apply LinCDE boost with a grid of depths. For each depth, we choose the number of iterations as above and record the associated validation log-likelihood. We choose the depth with the maximal validation log-likelihood.
In the following example, we experiment with depth = 1 and depth = 2. For depth = 1, we choose 100 iterations, and the associated log-likelihood is \(-0.80\). For depth = 2, we choose 50 iterations, and the associated log-likelihood is \(-0.76\). Since \(-0.76 > -0.80\), we choose depth = 2. Notice that the example’s conditional mean involves an interaction term of \(X_1\) and \(X_2\), and thus depth = 1 — an additive model in the density’s exponent — is too restrictive compared to depth = 2 — a richer model including first-order interactions.
In standard boosting, deep trees are problematic due to overfitting. In LinCDE boost, the overfitting issue is more severe than standard boosting because the density estimation problem at terminal nodes is more complicated than regression and classification. As a result, we do not recommend depth > 5.
model.LGD.D1 = LinCDE.boost(X = X, y = y.LGD, depth = 1, n.trees = 100, verbose = TRUE)
predict.LGD.D1 = predict(model.LGD.D1, X = XVal, y = yVal.LGD, densityOnly = FALSE)
model.LGD.D2 = LinCDE.boost(y = y.LGD, X = X, depth = 2, n.trees = 100, verbose = TRUE)
predict.LGD.D2 = predict(model.LGD.D2, X = XVal, y = yVal.LGD, densityOnly = FALSE)
There are a number of secondary hyper-parameters. We recommend starting with the default values and making changes only if certain issues are observed.
basis: the type of spline basis. Default is the cubic natural spline basis. Cubic natural splines are desirable for their flexibility, smoothness, and linear extensions beyond the boundaries. However, if the conditional densities are believed to belong to a certain distribution family, then specific basis should be adopted. Here are two examples.
splineDf: the number of spline basis. Default is \(10\). If you go with the natural cubic spline basis, then splineDf specifies the splines’ degrees of freedom, i.e., the number of spline bases. A larger splineDf is able to characterize more local structures but may produce unnecessary curvatures.
df: the ridge Poisson regression’s degrees of freedom. Default is \(2\). df is used for determining the ridge regularization hyper-parameter. A smaller df corresponds to a larger regularization parameter, and assists to avoid computational instabilities at subregions with a limited number of observations.
prior: type of the initial carrying density. Default is “Gaussian”, i.e., the Gaussian distribution with the marginal response mean and the standard deviation is used as the universal density initialization. If you set prior = “uniform”, the uniform distribution over the response range is used. If you set prior = “LindseyMarginal”, the marginal response density estimated by Lindsey’s method based on all responses is used. You can also input a homogeneous or heterogeneous conditional density function. The conditional density function should take a covariate matrix \(X\), a response vector \(y\), and output the densities at pairs \((X, y)\). If the prior conditional density is close to the underlying truth, e.g., a pretrained conditional density estimator, LinCDE boost will require less iterations.
In below, the prior distribution is similar to the true conditional density. It takes LinCDE boost about \(15\) iterations to converge, much faster compared to the \(50\) iterations with the Gaussian prior.
# input a heterogeneous conditional density function as prior
density.LGD.prior = function(X, y = NULL){
if(is.null(dim(X))){X = matrix(X, nrow = 1)}
if(!is.null(y)){
dens = dnorm(y, mean = (0.25 * X[, 1] + 0.5 * X[, 1] * X[, 2]), sd = (0.5 + 0.25 * X[, 2]))
}else{
y = 0.25 * X[, 1] + 0.5 * X[, 1] * X[, 2] + rnorm(dim(X)[1], 0, 1) * (0.5 + 0.25 * X[ ,2])
}
}
model.LGD.prior = LinCDE.boost(X = X, y = y.LGD, depth = 2, df = 2, n.trees = 100, prior = density.LGD.prior, verbose = T)
predict.LGD.prior = predict(model.LGD.prior, X = XVal, y = yVal.LGD, densityOnly = FALSE)
In the above example, we set hyperparameters splineDf = “Gaussian”, shrinkage = 0.02, depth = 3, n.trees = 300, and leave other parameters at defualt values.
model.LGD.final = LinCDE.boost(y = y.LGD, X = X, basis = "Gaussian", depth = 3, n.trees = 300, shrinkage = 0.02)
With a LinCDE boost model, we can predict conditional densities of an independent test dataset via function predict and evaluate LinCDE boost’ performance. In the paper, we recommend the assessment metric: the relative improvement in the test log-likelihood \[\begin{align*} \frac{\ell_{\text{LinCDE boost}} - \ell_{\text{null}}}{\ell_{\text{oracle}} - \ell_{\text{null}}}, \end{align*}\] where the null model is the universal Gaussian distribution with the response’s marginal mean and standard deviation, and the oracle denotes the true underlying conditional density. The criterion is analogous to the goodness-of-fit measure \(R^2\) of linear regression. In this example, LinCDE boost achieves \(80.9\%\) of the oracle’s improvement over the null model.
# LinCDE boost
predict.LGD.final = predict(model.LGD.final, X = XTest, y = yTest.LGD, densityOnly = FALSE)
# null model
model.LGD.null = LinCDE.boost(X = X, y = y.LGD, basis = "Gaussian", n.trees = 0)
predict.LGD.null = predict(model.LGD.null, X = XTest, y = yTest.LGD, densityOnly = FALSE)
# oracle
oracle = mean(log(density.LGD(X = XTest, y = yTest.LGD)))
# relative improvement
(relativeImprovement =
(predict.LGD.final$testLogLikelihood - predict.LGD.null$testLogLikelihood)/
(oracle - predict.LGD.null$testLogLikelihood))
#> [1] "relative improvement: 80.9%"
When the true density is unknown, the relative improvement can’t be computed. However, we can still obtain and use the absolute test log-likelihood as an assessment of LinCDE boost’s performance. Besides, the visualization tools below help evaluate fitted LinCDE boost models.
For visualization, we plot the estimated conditional densities against the truth at the above \(9\) selected feature points. The estimated conditional densities are close to the truth.
# conditional density plot
densityPlot(X = XProfile, trueDensity = density.LGD, model = model.LGD.final, factor.levels = paste("X1=", XProfile[,1], ", X2=", XProfile[,2], sep = ""))
To identify the influential covariates, we plot the importance scores for each covariate. The importance score barplot indicates that the first two candidates contribute the most to the improvement in the objective, which agrees with the underlying density model.
summary(model.LGD.final, cBars = 8)
#> var rel.inf
#> 1 X2 36.3531030
#> 2 X1 29.9086800
#> 3 X20 8.4665390
#> 4 X19 6.8705365
#> 5 X14 5.4844255
#> 6 X10 3.8044124
#> 7 X18 3.5250123
#> 8 X4 2.0696343
#> 9 X13 1.6067923
#> 10 X12 1.1105122
#> 11 X9 0.8003525
#> 12 X3 0.0000000
#> 13 X5 0.0000000
#> 14 X6 0.0000000
#> 15 X7 0.0000000
#> 16 X8 0.0000000
#> 17 X11 0.0000000
#> 18 X15 0.0000000
#> 19 X16 0.0000000
#> 20 X17 0.0000000
The above example shows LinCDE boost’s ability to capture location and scale changes. We add two more examples focusing on modality and skewness (asymmetry), respectively.
For modality, we generate locally Gaussian mixture responses if \(X_2 \le 0.2\) and locally Gaussian responses if \(X_2 > 0.2\). Meanwhile, we let the responses’ locations depend on \(X_1\). We follow the aforementioned workflow and set the hyper-parameters at shrinkage = 0.05, depth = 3, n.trees = 200. We use a larger df to learn the bi-modality. The rest of the parameters are at default values.
# data generation parameters
density.modality = function(X, y = NULL){
if(is.null(dim(X))){X = matrix(X, nrow = 1)}
if(!is.null(y)){
dens = dnorm(y, mean = 0.25 * X[,1], sd = 0.8)
index = which(X[,2] < 0.2)
dens[index] = 0.5 * dnorm(y[index], mean = 0.25 * X[index, 1] + 0.8, sd = 0.3) + 0.5 * dnorm(y[index], mean = 0.25 * X[index, 1] - 0.8, sd = 0.3)
return(dens)
}else{
n = dim(X)[1]
groupIndex = rbinom(n, 1, 0.5)
y = groupIndex * rnorm(n, -0.8, 0.3) + (1-groupIndex) * rnorm(n, 0.8, 0.3)
y = 0.25 * X[,1] + (X[,2] <= 0.2) * y + (X[,2] > 0.2) * rnorm(n, 0, 0.8)
}
}
set.seed(1)
y.modality = density.modality(X)
model.modality = LinCDE.boost(y = y.modality, X = X, df = 8,
depth = 3, n.trees = 200, shrinkage = 0.05)
We compare the estimated conditional densities against the truth at the \(9\) selected feature points. The estimated conditional densities are clearly bi-modal for \(X_2 = -0.6\) and \(X_2 = 0\). For \(X_2 = 0.6\), the estimated conditional densities are largely Gaussian with mild curvatures in the middle. In summary, LinCDE boost identifies \(X_1\) and \(X_2\) as the most important covariates correctly.
# conditional density plot
densityPlot(X = XProfile, trueDensity = density.modality, model = model.modality, factor.levels = paste("X1=", XProfile[,1], ", X2=", XProfile[,2], sep = ""))
summary(model.modality, cBars = 8)
#> var rel.inf
#> 1 X2 15.784887
#> 2 X1 12.961287
#> 3 X19 6.936327
#> 4 X15 6.250048
#> 5 X5 5.846550
#> 6 X18 5.766242
#> 7 X16 5.035929
#> 8 X17 4.901791
#> 9 X8 4.522908
#> 10 X4 4.473423
#> 11 X3 4.199582
#> 12 X12 4.082083
#> 13 X9 3.440578
#> 14 X14 3.146878
#> 15 X13 2.813074
#> 16 X10 2.644095
#> 17 X20 2.161298
#> 18 X11 2.025661
#> 19 X6 1.692540
#> 20 X7 1.314820
For symmetry, we generate asymmetric Gaussian mixture responses. If \(X_2 > 0\), the right modal is sharper; if \(X_2 < 0\), the left modal is sharper. The larger the \(|X_2|\) is, the more asymmetric the conditional distribution is. Meanwhile, we let the responses’ locations depend on \(X_1\). We follow the aforementioned workflow and set the hyper-parameters at shrinkage = 0.1, depth = 3, n.trees = 200, and df = 8. The rest parameters are at default values.
We compare the estimated conditional densities against the truth at \(9\) landmarks. The estimated conditional densities are right-skewed for \(X_2 = 0.6\), left-skewed for \(X_2 = -0.6\), and symmetric for \(X_2 = 0\). The \(X_1\) and \(X_2\) are correctly identified as important covariates.
# data generation parameters
density.skewness = function(X, y = NULL){
if(is.null(dim(X))){X = matrix(X, nrow = 1)}
if(!is.null(y)){
dens = 0.5 * dnorm(y, mean = 0.5 * X[, 1] + 0.8, sd = 0.5 + 0.45 * X[, 2]) +
0.5 * dnorm(y, mean = 0.5 * X[, 1] - 0.8, sd = 0.5 - 0.45 * X[, 2])
}else{
n = dim(X)[1]
groupIndex = rbinom(n, 1, 0.5)
y = groupIndex * rnorm(n, 0.8, 0.5 + 0.45 * X[, 2]) + (1 - groupIndex) * rnorm(n, -0.8, 0.5 - 0.45 * X[, 2]) + 0.5 * X[, 1]
}
}
set.seed(1)
y.skewness = density.skewness(X)
model.skewness = LinCDE.boost(X = X, y = y.skewness, df = 8,
depth = 3, n.trees = 200, shrinkage = 0.1)
# conditional density plot
densityPlot(X = XProfile, trueDensity = density.skewness, model = model.skewness, factor.levels = paste("X1=", XProfile[,1], ", X2=", XProfile[,2], sep = ""))
summary(model.skewness, cBars = 8)
#> var rel.inf
#> 1 X1 21.9539935
#> 2 X2 17.0168610
#> 3 X16 8.7661948
#> 4 X10 7.5742768
#> 5 X3 5.1747030
#> 6 X20 4.9460029
#> 7 X14 4.3826646
#> 8 X4 4.1571933
#> 9 X6 4.0269976
#> 10 X18 3.3791925
#> 11 X15 3.1927073
#> 12 X9 2.7585407
#> 13 X8 2.5577536
#> 14 X5 2.5184916
#> 15 X12 2.5166658
#> 16 X7 2.2210824
#> 17 X13 0.9773898
#> 18 X17 0.9410647
#> 19 X19 0.9382240
#> 20 X11 0.0000000
If a distribution’s conditional components differ violently in location, then the response discretization in Lindsey’s method could be problematic. In any local area, only a few bins are effective, and the rest see no observations. Such grouping is coarse, and there are no sufficient degrees of freedom to capture the distributional properties. We call this the “disjoint support” problem.
We propose to solve the “disjoint support” problem by centering. In particular, we suggest aligning the centers of the conditional densities in advance by estimating the conditional means first and subtracting the estimates from the responses. The residuals’ supports are less heterogeneous, and we apply LinCDE boost to the residuals to capture additional distributional structures.
In the following example, we generate locally Gaussian mixture responses if \(X_2 \le 0.2\), and locally Gaussian responses if \(X_2 > 0.2\). Meanwhile, we let the responses’ supports differ dramatically as \(X_1\) changes, and thus the “disjoint support” problem is present.
We follow the aforementioned workflow and set the hyper-parameters at shrinkage = 0.02, depth = 3, n.trees = 200 (the rest parameters are at default values). We compare LinCDE boost with and without centering. For centering, we use linear regression to estimate the conditional mean. In the relative influence plots, we stack importances from centering (in red) and beyond centering (in blue). \(X_1\) accounts for most of the importances in the centering, and \(X_2\) is the most important covariate beyond centering. Currently the decomposition of importances into centering and beyond centering is only available for centeringMethod = “linearRegression” or centeringMethod = “randomForest”.
We compute the estimated conditional densities against the truth at \(9\) landmarks. Without centering, LinCDE boost identifies the location shift but the predicted conditional densities are unimodal everywhere. With centering, LinCDE boost manages to produce the bi-modal structure for \(X_2 \le 0.2\).
# data generation parameters
density.centering = function(X, y = NULL){
if(is.null(dim(X))){X = matrix(X, nrow = 1)}
if(!is.null(y)){
dens = dnorm(y, mean = 4 * X[, 1], sd = 0.8)
index = which(X[, 2] < 0.2)
dens[index] = (0.5 * dnorm(y[index], mean = 4 * X[index, 1] + 0.8, sd = 0.3) + 0.5 * dnorm(y[index], mean = 4 * X[index, 1] - 0.8, sd = 0.3))
return(dens)
}else{
n = dim(X)[1]
groupIndex = rbinom(n, 1, 0.5)
y = groupIndex * rnorm(n, -0.8, 0.3) + (1 - groupIndex) * rnorm(n, 0.8, 0.3)
y = 4 * X[, 1] + (X[, 2] <= 0.2) * y + (X[, 2] > 0.2) * rnorm(n, 0, 0.8)
}
}
set.seed(1)
y.centering = density.centering(X)
# without centering
model.centering.no = LinCDE.boost(X = X, y = y.centering, depth = 2, n.trees = 200, shrinkage = 0.02, splineDf = 10)
# with centering
model.centering.OLS = LinCDE.boost(X = X, y = y.centering, df = 8, depth = 2, n.trees = 200, shrinkage = 0.02, centering = TRUE, centeringMethod = "linearRegression")
# without centering
summary(model.centering.no, cBars = 8)
#> var rel.inf
#> 1 X1 100
#> 2 X2 0
#> 3 X3 0
#> 4 X4 0
#> 5 X5 0
#> 6 X6 0
#> 7 X7 0
#> 8 X8 0
#> 9 X9 0
#> 10 X10 0
#> 11 X11 0
#> 12 X12 0
#> 13 X13 0
#> 14 X14 0
#> 15 X15 0
#> 16 X16 0
#> 17 X17 0
#> 18 X18 0
#> 19 X19 0
#> 20 X20 0
# without centering
densityPlot(X = XProfile, trueDensity = density.centering, model = model.centering.no)
# with centering
summary(model.centering.OLS, cBars = 8)
#> $location
#> var rel.inf
#> 1 X1 48.67170894
#> 2 X2 1.00455822
#> 3 X4 1.25320551
#> 4 X15 1.35724808
#> 5 X8 0.78505622
#> 6 X19 1.01778142
#> 7 X5 0.11011730
#> 8 X7 0.70497903
#> 9 X18 0.97139414
#> 10 X12 0.26730585
#> 11 X3 0.53809259
#> 12 X17 0.61531201
#> 13 X6 0.43059357
#> 14 X11 0.35993610
#> 15 X20 0.28815834
#> 16 X9 0.20157656
#> 17 X14 0.17681822
#> 18 X16 0.10398081
#> 19 X13 0.09084538
#> 20 X10 0.05469914
#>
#> $beyondLocation
#> var rel.inf
#> 1 X1 0.9966768
#> 2 X2 13.9411148
#> 3 X4 6.9998472
#> 4 X15 4.4420828
#> 5 X8 2.8556823
#> 6 X19 2.5691209
#> 7 X5 3.4664395
#> 8 X7 2.5975351
#> 9 X18 1.3300843
#> 10 X12 1.2039489
#> 11 X3 0.5941000
#> 12 X17 0.0000000
#> 13 X6 0.0000000
#> 14 X11 0.0000000
#> 15 X20 0.0000000
#> 16 X9 0.0000000
#> 17 X14 0.0000000
#> 18 X16 0.0000000
#> 19 X13 0.0000000
#> 20 X10 0.0000000
# with centering
densityPlot(X = XProfile, trueDensity = density.centering, model = model.centering.OLS, factor.levels = paste("X1=", XProfile[,1], ", X2=", XProfile[,2], sep = ""))
We remark that users can also input a function as the centeringMethod. We will use centeringMethod(y, X) to center the responses. In prediction, we will apply the predict function to the centering model. Below we define the centeringMethod to be the standard linear regression and feed it to LinCDE.boost. The LinCDE boost model model.centering.DIY should be the same as if we let centeringMethod = “linearRegression”, i.e., the model model.centering.OLS above.
meanMethod = function(y, X){
meanModelData = as.data.frame(cbind(y, X))
model = lm(y~., data = meanModelData)
}
model.centering.DIY = LinCDE.boost(X = X, y = y.centering, df = 8, depth = 2, n.trees = 200, shrinkage = 0.02, centering = TRUE, centeringMethod = meanMethod)
We read in the California Housing data, remove windsorized responses, and divide the responses by \(1000\). For time efficiency, we randomly select \(1000\) observations as the training data.
cahousing=read.csv("./data/cahousing.csv", header = T)
y.cahousing = cahousing[,1]; cahousing=cahousing[y.cahousing < max(y.cahousing),]
y.cahousing = cahousing[,1]/1000
X = data.matrix(cahousing[,-1])
set.seed(318); indexTrain = sample(length(y.cahousing), 1000, replace = FALSE)
We start from conditional estimation. We fit a simple gbm model on the training data. The most influential covariate to the conditional mean is median income (MedInc), followed by longtitude, lattitude, and average occupancy (AveOccup).
meanModel.cahousing = gbm::gbm.fit(y = y.cahousing[indexTrain], x = X[indexTrain, ], distribution = "gaussian", interaction.depth = 3, n.trees = 1000, shrinkage = 0.02, verbose = F)
summary(meanModel.cahousing)
#> var rel.inf
#> MedInc MedInc 46.974127
#> Longitude Longitude 13.310592
#> AveOccup AveOccup 12.341893
#> Latitude Latitude 11.317929
#> HouseAge HouseAge 5.119188
#> AveRooms AveRooms 4.385236
#> AveBedrms AveBedrms 3.288512
#> Population Population 3.262522
Next, we fit a simple LinCDE boost model with sufficient statistics \(y\),\(y^2\) as a start. Compared to the above conditional mean estimation, the LinCDE boost model also incorporates the variance information. The importances are similar to those from the gbm model: MedInc is the dominant covariate followed by AveOccup, Longitude, and Latitude. Let us plot the conditional density estimates for different (MedInc, AveOccup) pairs. Since we don’t know the underlying truth, we plot the histograms of the samples with similar covariate values as our reference. (The similarity of samples to a target point is measured by the covariate Mahalanobis distance weighted by the covariate importances from the LinCDE boost model.)
model.cahousing.Gaussian = LinCDE.boost(y = y.cahousing[indexTrain], X=X[indexTrain,], basis = "Gaussian", depth=3, n.trees=150, terminalSize=50, shrinkage = 0.05, minY = 0, maxY = 510)
summary(model.cahousing.Gaussian)
#> var rel.inf
#> 1 MedInc 30.488859
#> 2 AveOccup 16.912307
#> 3 Longitude 14.825653
#> 4 Latitude 11.179810
#> 5 HouseAge 8.989265
#> 6 Population 6.436688
#> 7 AveRooms 6.201678
#> 8 AveBedrms 4.965740
X.MedInc = quantile(X[,"MedInc"],c(.2,.5,.8)) # 20%, 50%, 80% quantiles of MedInc
X.AveOccup = quantile(X[,"AveOccup"],c(.2,.5,.8)) # 20%, 50%, 80% quantiles of AveOccup
XGrid=matrix(colMeans(X), byrow=TRUE, nrow=9, ncol=ncol(X)); colnames(XGrid) = colnames(X) # other covariates fixed at the sample means
XGrid[,c("MedInc", "AveOccup")] = data.matrix(expand.grid(X.MedInc, X.AveOccup))
yGrid = seq(0,510,length.out = 100) # response grid
plot.cahousing.Gaussian = densityPlot(X = XGrid, yGrid = yGrid, model=model.cahousing.Gaussian, plot = FALSE)
The estimated conditional densities seem to agree with the histograms in the dependencies of income and occupancy: the conditional mean increases as the income increases (from left to right), and the conditional variance decreases as the occupancy grows (from bottom to top) in the middle and right columns. Note that the histograms do not look Gaussian and let’s consider a more flexible basis for LinCDE boost.
Below we use \(10\) transformed natural cubic splines with \(6\) degrees of freedom.
model.cahousing.ns = LinCDE.boost(y=y.cahousing[indexTrain],X=X[indexTrain,], basis = "nsTransform", depth=3, splineDf = 10, df = 6, n.trees=300, terminalSize=50, shrinkage = 0.05, minY = 0, maxY = 510, verbose = TRUE)
summary(model.cahousing.ns)
#> var rel.inf
#> 1 MedInc 31.297311
#> 2 Longitude 14.588574
#> 3 AveOccup 14.468606
#> 4 Latitude 14.453976
#> 5 AveRooms 10.808165
#> 6 HouseAge 7.027820
#> 7 AveBedrms 4.317568
#> 8 Population 3.037978
The importances look similar. We make a plot of the estimated conditional densities at the same covariate configurations. As above, the conditional density’s location shifts to the right as the income increases, and the conditional density’s spread decreases as the occupancy grows in the middle and right columns. Different from above, the estimated conditional densities have diverse shapes beyond Gaussian. In local regions of \(2.32\) (\(20\%\) quantile) or \(3.45\) (\(50\%\) quantile) median incomes, the conditional distributions appear right-skewed. In regions of \(3.45\) (\(50\%\) quantile) median income and \(2.36\) (\(20\%\) quantile), \(2.84\) (\(50\%\) quantile) average occupancies, the conditional distributions have flatter peaks compared to the Gaussian distribution.
plot.cahousing.ns = densityPlot(X=XGrid, yGrid = yGrid, model = model.cahousing.ns, plot = FALSE)